# Create a code chunk and set your working directory
setwd("~/Downloads/CapstoneProject-02")

INTRODUCTION

OVERVIEW OF THE BELLABEAT COMPANY

Bellabeat is the go-to wellness brand for women, offering an ecosystem of products and services focused on women’s health. Founded by Urša Sršen and Sando Mur, Bellabeat aims to analyze data that could help unlock new opportunities and gain valuable marketing strategies.

STAKEHOLDER

The stakeholder include:

1. Urška Sršen, Bellabeat co-founder and Chief Executive Officer

2. Sando Mur, Mathematician and Bellabeat’s Co-founder

3. Bellabeat’s Marketing Analytics Team

Business Task

Bellabeat aims to analyze the usage data from one of its products to gain insights and make high-level recommendations that will inform its marketing strategy.

Questions for Analysis

  1. What are some trends found in smart device usage?
  2. How could these trends affect Bellabeat customers?
  3. How could these trends impact Bellabeat’s marketing strategy

DATA PREPARATION

Data Source:

The FitBit Fitness Tracker Data dataset by Mobius under the license CCO: Public Domain was used. This dataset was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.206 - 05.12.2016. Around 30 eligible FitBit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.

Sorting & Filtering

My analysis is focused on trends in the usage of the app, which is why my analysis will be focused on user engagament.

Load Packages

library(readr)
library(tidyverse)
library(dplyr)
library(lubridate)
library(tidyr)

Importing the Datasets

DailyActivity_01 <- read_csv("archive/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/dailyActivity_merged.csv")
## Rows: 457 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
DailyActivity_02 <- read_csv("archive/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
DailySleep <- read_csv("archive/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
HourlySteps_01 <- read_csv("archive/mturkfitbit_export_3.12.16-4.11.16/Fitabase Data 3.12.16-4.11.16/hourlySteps_merged.csv")
## Rows: 24084 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
HourlySteps_02 <- read_csv("archive/mturkfitbit_export_4.12.16-5.12.16/Fitabase Data 4.12.16-5.12.16/hourlySteps_merged.csv")
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Preview Data

To prove that the file has successfully loaded, here are the first few rows of one of the dataframe, which is DailyActvity_01.

head(DailyActivity_01)
## # A tibble: 6 × 15
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
##        <dbl> <chr>             <dbl>         <dbl>           <dbl>
## 1 1503960366 3/25/2016         11004          7.11            7.11
## 2 1503960366 3/26/2016         17609         11.6            11.6 
## 3 1503960366 3/27/2016         12736          8.53            8.53
## 4 1503960366 3/28/2016         13231          8.93            8.93
## 5 1503960366 3/29/2016         12041          7.85            7.85
## 6 1503960366 3/30/2016         10970          7.16            7.16
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>

Check Sample Size

Next, let’s take a look at and check the sample size of each dataframe.

# Checking distinct values to find out the sample size
n_distinct(DailyActivity_01$Id)
## [1] 35
n_distinct(DailyActivity_02$Id)
## [1] 33
n_distinct(DailySleep$Id)
## [1] 24
n_distinct(HourlySteps_01$Id)
## [1] 34
n_distinct(HourlySteps_02$Id)
## [1] 33

From this information, it is evident that there are between 33 and 35 partcipants in most dataframes; however, the DailySleep dataframe contains only 24 users.

DATA CLEANING

Merge Datasets

DailyActivity_01 and DailyActivity_02 are essentially identical, except for their different time periods. Therefore, I need to merge them into a single dataframe named ‘DailyActivity’. As well as DailyActivity, HourlySteps_01 & HourlySteps_02 are identical,except for their different time periods. Therefore, I need to merge them into a single dataframe too named ‘HourlySteps’.

# Merge DailyActivity_01 and DailyActivity_02 together
DailyActivity <- rbind(DailyActivity_01, DailyActivity_02)

# Merge HourlyStep_01 and HourlyStep_02 together
HourlySteps <- rbind(HourlySteps_01, HourlySteps_02)

I discovered an issue with the data types of the ‘ActivityDate’ columns across the DailyActivity datraframes, as well as ‘ActivityHour’ columns across the HourlySteps dataframes, and ‘SleepDay’ column across the DailySleep dataframe. Before proceeding with further analysis, I need to convert them from characters to a date format.

Create Clean Data Frames

Before proceeding, I created a duplicate dataframe for every dataframe named beginning with Clean_DataframeName to ensure the original data remains unchanged.

# Create a new data frame for DailyActivity
Clean_DailyActivity <- DailyActivity

# Verify
head(Clean_DailyActivity)
## # A tibble: 6 × 15
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
##        <dbl> <chr>             <dbl>         <dbl>           <dbl>
## 1 1503960366 3/25/2016         11004          7.11            7.11
## 2 1503960366 3/26/2016         17609         11.6            11.6 
## 3 1503960366 3/27/2016         12736          8.53            8.53
## 4 1503960366 3/28/2016         13231          8.93            8.93
## 5 1503960366 3/29/2016         12041          7.85            7.85
## 6 1503960366 3/30/2016         10970          7.16            7.16
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>

Format Dates

library(lubridate)

# Convert ActivityDate to proper date format
Clean_DailyActivity$ActivityDate <- mdy(Clean_DailyActivity$ActivityDate)

# Check the results
head(Clean_DailyActivity$ActivityDate)
## [1] "2016-03-25" "2016-03-26" "2016-03-27" "2016-03-28" "2016-03-29"
## [6] "2016-03-30"
class(Clean_DailyActivity$ActivityDate)
## [1] "Date"

Now the data type of the ‘ActivityDate’ column in the Clean_DailyActivity frame is changed to Date.

Next, it’s time to change the data type of the ‘ActivityHour’ column in the HourlySteps dataframe, but first let’s create a new dataframe named ‘Clean_HourlySteps’ so that the original dataframe remains the same.

# Create a new data frame for HourlySteps
Clean_HourlySteps <- HourlySteps

# Verify
head(Clean_HourlySteps)
## # A tibble: 6 × 3
##           Id ActivityHour          StepTotal
##        <dbl> <chr>                     <dbl>
## 1 1503960366 3/12/2016 12:00:00 AM         0
## 2 1503960366 3/12/2016 1:00:00 AM          0
## 3 1503960366 3/12/2016 2:00:00 AM          0
## 4 1503960366 3/12/2016 3:00:00 AM          0
## 5 1503960366 3/12/2016 4:00:00 AM          0
## 6 1503960366 3/12/2016 5:00:00 AM          0

Now, change the format to Date.

# Convert ActivityHour to proper date format
Clean_HourlySteps <- Clean_HourlySteps %>%
  mutate(
    ActivityHour = mdy_hms(ActivityHour),  # converts "3/12/2016 10:00:00 AM"
    hour = hour(ActivityHour)
  )

# Check the results
head(Clean_HourlySteps$ActivityHour)
## [1] "2016-03-12 00:00:00 UTC" "2016-03-12 01:00:00 UTC"
## [3] "2016-03-12 02:00:00 UTC" "2016-03-12 03:00:00 UTC"
## [5] "2016-03-12 04:00:00 UTC" "2016-03-12 05:00:00 UTC"
class(Clean_HourlySteps$ActivityHour)
## [1] "POSIXct" "POSIXt"

Now the data type of the ‘ActivityHour’ column in the Clean_HourlySteps data frame is changed to POSIXct.

Last, I need to change the data type of ‘SleepDay’ column from character to date, but same like two other dataframe, I need to create a new dataframe so that the original dataframe remain the same vallues.

# Create a new data frame for DailySleep
Clean_DailySleep <- DailySleep

# Verify
head(Clean_DailySleep)
## # A tibble: 6 × 5
##           Id SleepDay        TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
##        <dbl> <chr>                       <dbl>              <dbl>          <dbl>
## 1 1503960366 4/12/2016 12:0…                 1                327            346
## 2 1503960366 4/13/2016 12:0…                 2                384            407
## 3 1503960366 4/15/2016 12:0…                 1                412            442
## 4 1503960366 4/16/2016 12:0…                 2                340            367
## 5 1503960366 4/17/2016 12:0…                 1                700            712
## 6 1503960366 4/19/2016 12:0…                 1                304            320

Now, change the format to Date.

# Convert SleepDay to proper date format
Clean_DailySleep <- Clean_DailySleep %>%
  mutate(
    SleepDay = mdy_hms(SleepDay),  # converts "3/12/2016 10:00:00 AM"
    hour = hour(SleepDay)
  )

# Check the results
head(Clean_DailySleep$SleepDay)
## [1] "2016-04-12 UTC" "2016-04-13 UTC" "2016-04-15 UTC" "2016-04-16 UTC"
## [5] "2016-04-17 UTC" "2016-04-19 UTC"
class(Clean_DailySleep$SleepDay)
## [1] "POSIXct" "POSIXt"

Now the data type of the ‘SleepDay’ column in the Clean_DailySleep data frame is changed to POSIXct.

After that, I need to find out if there are duplicate users within these 3 dataframes (Clean_DailyActivity, Clean_DailySleep, and Clean_HourlySteps)

Check Duplicate

#Checking any duplicate based on user + date on DailyActivity dataframe
sum(duplicated(Clean_DailyActivity[, c("Id", "ActivityDate")]))
## [1] 24

Turned out that 24 duplicate rows are appearing on the DailyActivity / Clean_DailyActivity dataframe with different values. The next step I take is to remove the row with the lowest value and keep the row with the highest value.

# Keep the row with max steps or values
Clean_DailyActivity <- Clean_DailyActivity %>%
  group_by(Id, ActivityDate) %>%
  slice_max(TotalSteps, n = 1, with_ties = FALSE) %>%
  ungroup()
#Checking any duplicate based on user + date on DailySleep dataframe
sum(duplicated(Clean_DailySleep[, c("Id", "SleepDay")]))
## [1] 3

After running and checking the duplicate on the DailySleep / Clean_DailySleep dataframe, it appears that there were 3 duplicate rows. But unlike the Clean_DailyActivity, the duplicate in this dataframe has the exact value, so I removed the duplicate and only keep the first occurrence.

# Keep only the first occurrence, and remove the rest
Clean_DailySleep <- Clean_DailySleep %>%
  group_by(Id, SleepDay) %>%
  slice(1) %>%
  ungroup()
# Checking any duplicate based on user + date on Clean_HourlySteps dataframe
sum(duplicated(Clean_HourlySteps[, c("Id", "ActivityHour")]))
## [1] 175

There are at least 175 duplicates on the HourlySteps / Clean_HourlySteps dataframe with the same value, just like the previous dataframe. The next stap is to remove the duplicate and only keep the first occurance.

# Keep only the first occurrence, and remove the rest
Clean_HourlySteps <- Clean_HourlySteps %>%
  group_by(Id, ActivityHour) %>%
  slice(1) %>%
  ungroup()

Now, all of the dataframe no longer has any duplicate data.

Next is to figure out if there are missing values in each data frame.

Check Missing Values

# 1. DailyActivity
colSums(is.na(Clean_DailyActivity))
##                       Id             ActivityDate               TotalSteps 
##                        0                        0                        0 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                        0                        0                        0 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                        0                        0                        0 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                        0                        0                        0 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                        0                        0                        0
colMeans(is.na(Clean_DailyActivity)) * 100
##                       Id             ActivityDate               TotalSteps 
##                        0                        0                        0 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                        0                        0                        0 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                        0                        0                        0 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                        0                        0                        0 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                        0                        0                        0
# 2. DailySleep
colSums(is.na(Clean_DailySleep))
##                 Id           SleepDay  TotalSleepRecords TotalMinutesAsleep 
##                  0                  0                  0                  0 
##     TotalTimeInBed               hour 
##                  0                  0
colMeans(is.na(Clean_DailySleep)) * 100
##                 Id           SleepDay  TotalSleepRecords TotalMinutesAsleep 
##                  0                  0                  0                  0 
##     TotalTimeInBed               hour 
##                  0                  0
# 3. HourlySteps
colSums(is.na(Clean_HourlySteps))
##           Id ActivityHour    StepTotal         hour 
##            0            0            0            0
colMeans(is.na(Clean_HourlySteps)) * 100
##           Id ActivityHour    StepTotal         hour 
##            0            0            0            0

Based on this information, there is evidence that no missing values in each dataframe, and the Data are now clean and ready for analysis.

ANALYSIS

Let’s have a look at the summary statistics of the DailyActivity dataset to find out the overall activity patterns.

Overall Activity

Summary 1

# Core Activity Summary
Clean_DailyActivity %>%  
  select(TotalSteps,
         TotalDistance,
         SedentaryMinutes,
         LightlyActiveMinutes,
         FairlyActiveMinutes,
         VeryActiveMinutes,
         Calories) %>%
  summary()
##    TotalSteps    TotalDistance    SedentaryMinutes LightlyActiveMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0     Min.   :  0.0       
##  1st Qu.: 3321   1st Qu.: 2.280   1st Qu.: 734     1st Qu.:117.0       
##  Median : 7142   Median : 5.030   Median :1062     Median :196.0       
##  Mean   : 7377   Mean   : 5.289   Mean   :1001     Mean   :188.1       
##  3rd Qu.:10645   3rd Qu.: 7.570   3rd Qu.:1246     3rd Qu.:263.0       
##  Max.   :36019   Max.   :28.030   Max.   :1440     Max.   :720.0       
##  FairlyActiveMinutes VeryActiveMinutes    Calories   
##  Min.   :  0.0       Min.   :  0.00    Min.   :   0  
##  1st Qu.:  0.0       1st Qu.:  0.00    1st Qu.:1820  
##  Median :  6.0       Median :  2.00    Median :2129  
##  Mean   : 13.6       Mean   : 19.87    Mean   :2295  
##  3rd Qu.: 18.0       3rd Qu.: 30.00    3rd Qu.:2781  
##  Max.   :660.0       Max.   :210.00    Max.   :4900

Daily Activity Summary Key Insights

After analyzing 940 daily activity records from 35 users, several key patterns emerged:

1. Overall Activity Levels - Users averaged 7,377 steps per day, which is below the commonly recommended 10,000 steps. The median of 7,142 steps indicates that half of all recorded days had even fewer steps, suggesting significant room for improvement in daily activity levels.

2. Sedentary Lifestyle Concerns - Perhaps most concerning, users were sedentary for an average of 16.7 hours per day. This high sedentary time is consistent with desk-based work environments but presents health risks and a key opportunity for intervention.

3. Exercise Intensity - While users averaged 188 minutes of light activity daily, they only achieved about 34 minutes of moderate-to-vigorous physical activity (fairly + very active minutes combined). This falls slightly above the WHO minimum recommendation of 21 minutes per day but suggests most users could benefit from more intense exercise.

Let’s make some visualizations based on this

# Histogram of daily steps
ggplot(Clean_DailyActivity, aes(x = TotalSteps)) +
  geom_histogram(binwidth = 1000, fill = "#2E86AB", color = "white") +
  geom_vline(aes(xintercept = mean(TotalSteps)), 
             color = "red", linetype = "dashed", size = 1) +
  labs(title = "Distribution of Daily Steps",
       subtitle = paste("Average:", round(mean(Clean_DailyActivity$TotalSteps)), "steps"),
       x = "Total Steps",
       y = "Frequency") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The distribution reveals significant variability in daily activity. A notable spike at zero steps indicates days when users either didn’t wear their devices or were completely inactive. The majority of activity days cluster between 5,000-10,000 steps, with a small group of highly active days exceeding 20,000 steps. This suggests Bellabeat should focus on both consistency (reducing zero-step days) and motivation (helping users reach 10,000 steps more regularly).

# Histogram of calories
ggplot(Clean_DailyActivity, aes(x = Calories)) +
  geom_histogram(binwidth = 200, fill = "#A23B72", color = "white") +
  geom_vline(aes(xintercept = mean(Calories)), 
             color = "red", linetype = "dashed", size = 1) +
  labs(title = "Distribution of Daily Calories Burned",
       subtitle = paste("Average:", round(mean(Clean_DailyActivity$Calories)), "calories"),
       x = "Calories",
       y = "Frequency") +
  theme_minimal()

Calories expenditure follows a more normal distribution centered around 2,000-2,500 calories per day, which aligns with typical adult basal metabolic rates plus light activity. The distribution suggests most users maintain relatively consistent daily energy expenditure, even when step counts vary significantly. This indicates that factors beyond steps (such as body composition and non-tracked activities) play important roles in total calorie burn.

# Calculate total active vs sedentary time
ActivityBreakdown <- Clean_DailyActivity %>%
  summarise(
    Sedentary = mean(SedentaryMinutes),
    `Lightly Active` = mean(LightlyActiveMinutes),
    `Fairly Active` = mean(FairlyActiveMinutes),
    `Very Active` = mean(VeryActiveMinutes)
  ) %>%
  pivot_longer(everything(), names_to = "Activity_Type", values_to = "Minutes")
# bar chart
ggplot(ActivityBreakdown, aes(x = reorder(Activity_Type, -Minutes), y = Minutes, fill = Activity_Type)) +
  geom_col() +
  geom_text(aes(label = round(Minutes, 0)), vjust = -0.5) +
  scale_fill_manual(values = c("Sedentary" = "#E63946", 
                                "Lightly Active" = "#F4A261",
                                "Fairly Active" = "#2A9D8F",
                                "Very Active" = "#264653")) +
  labs(title = "Average Daily Activity Minutes by Intensity",
       x = "Activity Level",
       y = "Minutes") +
  theme_minimal() +
  theme(legend.position = "none")

This visualization reveals the most concerning finding of our analysis: Users spend an overwhelming 1,001 minutes (16.7 hours) per day sedentary, compared to just 34 minutes of moderate-to-vigorous physical activity. This 30:1 ratio of sitting to meaningful exercise represents a significant health risk and a major opportunity for Bellabeat intervention. While users do achieve 188 minutes of light activity (likely walking, household tasks), they fall far short of recommended exercise levels.

# Box plot for steps (shows outliers and distribution)
ggplot(Clean_DailyActivity, aes(y = TotalSteps)) +
  geom_boxplot(fill = "#2E86AB", alpha = 0.7) +
  labs(title = "Daily Steps Distribution with Outliers",
       y = "Total Steps") +
  theme_minimal() +
  theme(axis.text.x = element_blank())

The box plot confirms the wide variability in user behavior, with 50% of days falling between 3,300-10,600 steps. Numerous outliers above 20,000 steps suggest users are capable of high activity but don’t sustain it consistently. This indicates Bellabeat should focus on helping users maintain moderate consistency (7,000-10,000 steps daily) rather than encouraging occasional extreme activity days.

User Classification (Sedentary vs. Active)

Summary 2

Create user summary

user_summary <- Clean_DailyActivity %>%
  group_by(Id) %>%
  summarise(
    avg_steps = mean(TotalSteps),
    avg_calories = mean(Calories),
    avg_active_minutes = mean(VeryActiveMinutes + FairlyActiveMinutes),
    avg_sedentary_minutes = mean(SedentaryMinutes),
    tracking_days = n()
  ) %>%
  ungroup()

# View the data
head(user_summary)
## # A tibble: 6 × 6
##           Id avg_steps avg_calories avg_active_minutes avg_sedentary_minutes
##        <dbl>     <dbl>        <dbl>              <dbl>                 <dbl>
## 1 1503960366    12175.        1845.              56.7                   850.
## 2 1624580081     5137.        1449.               9.67                 1279.
## 3 1644430081     7781.        2838.              37.8                  1130.
## 4 1844505072     2944.        1614.               1.48                 1176.
## 5 1927972279     1299.        2225.               2.02                 1241.
## 6 2022484408    11711.        2533.              57.9                  1111.
## # ℹ 1 more variable: tracking_days <int>
summary(user_summary)
##        Id              avg_steps        avg_calories  avg_active_minutes
##  Min.   :1.504e+09   Min.   :  773.6   Min.   :1449   Min.   :  0.2619  
##  1st Qu.:2.610e+09   1st Qu.: 4472.0   1st Qu.:1892   1st Qu.:  8.6855  
##  Median :4.445e+09   Median : 7363.0   Median :2192   Median : 26.2667  
##  Mean   :4.845e+09   Mean   : 7078.9   Mean   :2275   Mean   : 34.0384  
##  3rd Qu.:6.869e+09   3rd Qu.: 8671.9   3rd Qu.:2637   3rd Qu.: 54.0292  
##  Max.   :8.878e+09   Max.   :16759.4   Max.   :3488   Max.   :115.2439  
##  avg_sedentary_minutes tracking_days  
##  Min.   : 656.2        Min.   : 8.00  
##  1st Qu.: 781.1        1st Qu.:38.50  
##  Median :1099.9        Median :42.00  
##  Mean   :1012.4        Mean   :39.23  
##  3rd Qu.:1197.0        3rd Qu.:42.00  
##  Max.   :1369.3        Max.   :62.00
user_summary <- user_summary %>%
  mutate(
    user_type = case_when(
      avg_steps < 5000 ~ "Sedentary",
      avg_steps < 7500 ~ "Lightly Active",
      avg_steps < 10000 ~ "Fairly Active",
      TRUE ~ "Very Active"
    )
  )

# Show each category with the percentages
user_summary %>%
  count(user_type) %>%
  mutate(percentage = n / sum(n) * 100)
## # A tibble: 4 × 3
##   user_type          n percentage
##   <chr>          <int>      <dbl>
## 1 Fairly Active      8       22.9
## 2 Lightly Active     9       25.7
## 3 Sedentary         11       31.4
## 4 Very Active        7       20

Now let’s create a visualization based on this

user_type_summary <- user_summary %>%
  count(user_type) %>%
  mutate(percentage = n / sum(n) * 100)

ggplot(user_type_summary, aes(x = reorder(user_type, -n), y = n, fill = user_type)) +
  geom_col() +
  geom_text(aes(label = paste0(n, " (", round(percentage, 1), "%)")), 
            vjust = -0.5, size = 4) +
  scale_fill_manual(values = c("Sedentary" = "#E63946", 
                                "Lightly Active" = "#F4A261", 
                                "Fairly Active" = "#2A9D8F", 
                                "Very Active" = "#264653")) +
  labs(title = "User Activity Level Distribution",
       subtitle = "Most users (57%) are sedentary or lightly active",
       x = "Activity Level",
       y = "Number of Users") +
  theme_minimal() +
  theme(legend.position = "none") +
  ylim(0, 13)

Interpretation:

1. The majority need significant support (57%): The largest segment consist of sedentary (31.4%) and lightly active (25.7%) users, together representing 57% of the user nase. These 20 users are not meeting the CDC’s recommended 10.000 daily steps and represent Bellabeat’s target audience for intervention.

2. Even distribution across activity levels: Unlike a typical population where most people would cluster in one category, our users show relatively even distribution across all four levels (ranging from 20-31%). This suggests diverse fitness and motivations within the user base.

3. Only 1 in 5 users meets recommended goals: Just 7 users (20%) achieve the “very active” classification of 10.000+ steps daily. This reveals a significant opportunity: 80% of users need help reaching optimal activity levels. 4. The “Almost There” group: The 8 “Fairly Active” users (22.9%) averaging 7.500 - 10.000 steps represent a key opportunity—they’re close to meeting goals and may respond well to targeted motivation to push them over te 10.000 step threshold.

Activity Relationship

Step vs. Calories Correlation

# Scatter plot: Steps vs. Calories
ggplot(Clean_DailyActivity, aes(x = TotalSteps, y = Calories)) +
  geom_point(alpha = 0.5, color = "#2E86AB") +
  geom_smooth(method = "lm", color = "#E63946", se = TRUE) +
  labs(title = "Relationship Between Daily Steps and Calories Burned",
       subtitle = paste("Correlation:", 
                       round(cor(Clean_DailyActivity$TotalSteps, Clean_DailyActivity$Calories), 3)),
       x = "Total Steps",
       y = "Calories Burned") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Calculate correlation
cor(Clean_DailyActivity$TotalSteps, Clean_DailyActivity$Calories)
## [1] 0.5800765

From this visualization, it is evident that there is a Strong Correlation between the relationship of Daily Steps and Calories Burned with Correlation Coefficient 0.58. While steps show a solid correlation with calories, it’s weaker than distance and intensity, let’s take a look further to find out.

Active Minutes vs. Calories

# Create total active minutes variable
Clean_DailyActivity <- Clean_DailyActivity %>%
  mutate(TotalActiveMinutes = VeryActiveMinutes + FairlyActiveMinutes + LightlyActiveMinutes)

# Scatter plot: Active Minutes vs. Calories
ggplot(Clean_DailyActivity, aes(x = TotalActiveMinutes, y = Calories)) +
  geom_point(alpha = 0.5, color = "#2A9D8F") +
  geom_smooth(method = "lm", color = "#E63946", se = TRUE) +
  labs(title = "Relationship Between Active Minutes and Calories Burned",
       subtitle = paste("Correlation:", 
                       round(cor(Clean_DailyActivity$TotalActiveMinutes, Clean_DailyActivity$Calories), 3)),
       x = "Total Active Minutes",
       y = "Calories Burned") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Calculate correlation
cor(Clean_DailyActivity$TotalActiveMinutes, Clean_DailyActivity$Calories)
## [1] 0.4689218

Next, from this scatter plot visualization, it is evident that there is a Moderate Correlation between Active Minutes and Calories Burned with Correlation Coefficient 0.469. Surprisingly, total active minutes (including light, fair, and very active) shows the weakest correlation.

Very Active Minutes vs. Calories

# Scatter plot: Very Active Minutes vs. Calories
ggplot(Clean_DailyActivity, aes(x = VeryActiveMinutes, y = Calories)) +
  geom_point(alpha = 0.5, color = "#F4A261") +
  geom_smooth(method = "lm", color = "#E63946", se = TRUE) +
  labs(title = "Impact of High-Intensity Activity on Calories Burned",
       subtitle = paste("Correlation:", 
                       round(cor(Clean_DailyActivity$VeryActiveMinutes, Clean_DailyActivity$Calories), 3)),
       x = "Very Active Minutes",
       y = "Calories Burned") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Calculate correlation
cor(Clean_DailyActivity$VeryActiveMinutes, Clean_DailyActivity$Calories)
## [1] 0.593829

Moving on to High-Intensity Activity vs. Calories Burned. From this scatter plot, it is evident that there is a Strong Correlation between High-Intensity and Calories Burned with Correlation Coefficient 0.594. Very active minutes show the second-strongest correlation with calories, despite users averaging only 20 minutes daily of this intensity.

This reveals an important insight: quality matters as much as quantity.

Distance vs. Calories

# Scatter plot: Distance vs. Calories
ggplot(Clean_DailyActivity, aes(x = TotalDistance, y = Calories)) +
  geom_point(alpha = 0.5, color = "#264653") +
  geom_smooth(method = "lm", color = "#E63946", se = TRUE) +
  labs(title = "Relationship Between Distance Traveled and Calories Burned",
       subtitle = paste("Correlation:", 
                       round(cor(Clean_DailyActivity$TotalDistance, Clean_DailyActivity$Calories), 3)),
       x = "Total Distance (km)",
       y = "Calories Burned") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Calculate correlation
cor(Clean_DailyActivity$TotalDistance, Clean_DailyActivity$Calories)
## [1] 0.6295828

Last, there is a Strong Correlation between Distance and Calories burned with Coefficient Correlation 0.63. The relationship between total distance traveled and calories burned shows the highest correlation. This makes intuitive sense—covering more ground requires more energy expenditure regardless of pace.

Time-Based Pattern

Day of Week Analysis - Are users more active on weekday vs. weekends?

# Add day of week column
Clean_DailyActivity <- Clean_DailyActivity %>%
  mutate(
    day_of_week = wday(ActivityDate, label = TRUE),  # Mon, Tue, Wed...
    day_type = ifelse(day_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday")
  )

# Check it worked
head(Clean_DailyActivity %>% select(ActivityDate, day_of_week, day_type))
## # A tibble: 6 × 3
##   ActivityDate day_of_week day_type
##   <date>       <ord>       <chr>   
## 1 2016-03-25   Fri         Weekday 
## 2 2016-03-26   Sat         Weekend 
## 3 2016-03-27   Sun         Weekend 
## 4 2016-03-28   Mon         Weekday 
## 5 2016-03-29   Tue         Weekday 
## 6 2016-03-30   Wed         Weekday
# Summary stats by day type
Clean_DailyActivity %>%
  group_by(day_type) %>%
  summarise(
    avg_steps = mean(TotalSteps),
    avg_calories = mean(Calories),
    avg_distance = mean(TotalDistance),
    avg_active_minutes = mean(VeryActiveMinutes + FairlyActiveMinutes),
    avg_sedentary = mean(SedentaryMinutes)
  )
## # A tibble: 2 × 6
##   day_type avg_steps avg_calories avg_distance avg_active_minutes avg_sedentary
##   <chr>        <dbl>        <dbl>        <dbl>              <dbl>         <dbl>
## 1 Weekday      7453.        2302.         5.34               33.7         1008.
## 2 Weekend      7188.        2277.         5.16               32.9          984.

From these results, it is shown that there is NO Significant Difference between Weekday and Weekend. The difference only ~265 steps.

# Calculate average by day of week
daily_summary <- Clean_DailyActivity %>%
  group_by(day_of_week) %>%
  summarise(
    avg_steps = mean(TotalSteps),
    avg_calories = mean(Calories),
    count = n()
  )

# Line chart showing weekly pattern
ggplot(daily_summary, aes(x = day_of_week, y = avg_steps, group = 1)) +
  geom_line(color = "#2E86AB", size = 1.2) +
  geom_point(color = "#2E86AB", size = 3) +
  labs(title = "Average Daily Steps by Day of Week",
       subtitle = "Do users maintain consistency throughout the week?",
       x = "Day of Week",
       y = "Average Steps") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0))

# Box plot comparing weekday vs weekend
ggplot(Clean_DailyActivity, aes(x = day_type, y = TotalSteps, fill = day_type)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Weekday" = "#2E86AB", "Weekend" = "#E63946")) +
  labs(title = "Activity Levels: Weekdays vs Weekends",
       x = "Day Type",
       y = "Total Steps") +
  theme_minimal() +
  theme(legend.position = "none")

# Bar chart with multiple metrics
Clean_DailyActivity %>%
  group_by(day_type) %>%
  summarise(
    Steps = mean(TotalSteps),
    Calories = mean(Calories),
    `Active Minutes` = mean(VeryActiveMinutes + FairlyActiveMinutes)
  ) %>%
  pivot_longer(cols = -day_type, names_to = "Metric", values_to = "Value") %>%
  ggplot(aes(x = day_type, y = Value, fill = day_type)) +
  geom_col() +
  facet_wrap(~Metric, scales = "free_y") +
  scale_fill_manual(values = c("Weekday" = "#2E86AB", "Weekend" = "#E63946")) +
  labs(title = "Weekday vs Weekend Activity Comparison",
       x = "",
       y = "Average Value") +
  theme_minimal() +
  theme(legend.position = "none")

Hourly Patterns - When during the day are users most active?

# Step 1: Convert ActivityHour to datetime and extract hour
HourlySteps <- HourlySteps %>%
  mutate(
    ActivityHour = mdy_hms(ActivityHour),  # Convert to datetime
    hour = hour(ActivityHour)               # Extract just the hour (0-23)
  )

# Step 2: NOW create hourly pattern grouped by hour only
hourly_pattern <- HourlySteps %>%
  group_by(hour) %>%  # Group by hour (0-23), not full datetime
  summarise(
    avg_steps = mean(StepTotal),
    median_steps = median(StepTotal),
    total_records = n()
  ) %>%
  arrange(hour)

# View the results
head(hourly_pattern)
## # A tibble: 6 × 4
##    hour avg_steps median_steps total_records
##   <int>     <dbl>        <dbl>         <int>
## 1     0     43.4             0          1955
## 2     1     21.7             0          1954
## 3     2     13.7             0          1954
## 4     3      6.89            0          1952
## 5     4     11.2             0          1950
## 6     5     34.6             0          1949
ggplot(hourly_pattern, aes(x = hour, y = avg_steps)) +
  geom_line(color = "#2E86AB", linewidth = 1.2) +
  geom_point(color = "#2E86AB", size = 3) +
  geom_area(alpha = 0.3, fill = "#2E86AB") +
  scale_x_continuous(breaks = seq(0, 23, 2),
                     labels = c("12AM", "2AM", "4AM", "6AM", "8AM", "10AM",
                               "12PM", "2PM", "4PM", "6PM", "8PM", "10PM")) +
  labs(title = "Average Hourly Step Count Throughout the Day",
       subtitle = "Peak activity occurs between 5-7 PM (evening)",
       x = "Hour of Day",
       y = "Average Steps per Hour") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation:

The hourly activity pattern reveals distinct behavioral trends throughout the day:

1. Sleep Period (12 AM - 5 AM): Minimal activity (~20-50 steps/hour) - Users are clearly asleep during these hours - Baseline activity from restless movements

2. Morning Ramp-Up (6 AM - 8 AM): Sharp increase (150 → 400 steps/hour) - Morning routines and commutes begin - Activity doubles within 2 hours

3. Mid-Morning Plateau (9 AM - 11 AM): Sustained moderate activity (~450-500 steps/hour) - Likely reflects desk work with occasional movement - Office workers taking short walks

4. Lunch Peak (12 PM - 1 PM): First daily peak (~550 steps/hour) - Lunch break activity - Walking to restaurants or outdoor breaks

5. Afternoon Dip (3 PM): Notable decrease (~400 steps/hour) - Post-lunch slump - “Dead zone” for activity

6. Evening Peak (5 PM - 7 PM): Highest activity of the day (600+ steps/hour) - 6 PM shows maximum activity - Post-work exercise, evening walks - Commute home + intentional fitness

7. Evening Wind-Down (8 PM - 11 PM): Gradual decline (380 → 200 steps/hour) - Dinner, relaxation, home activities - Preparing for sleep

Sleep Analysis

Summary 3

# Overall sleep summary
Clean_DailySleep %>%
  mutate(sleep_hours = TotalMinutesAsleep / 60) %>%
  select(TotalMinutesAsleep, sleep_hours, TotalTimeInBed) %>%
  summary()
##  TotalMinutesAsleep  sleep_hours      TotalTimeInBed 
##  Min.   : 58.0      Min.   : 0.9667   Min.   : 61.0  
##  1st Qu.:361.0      1st Qu.: 6.0167   1st Qu.:403.8  
##  Median :432.5      Median : 7.2083   Median :463.0  
##  Mean   :419.2      Mean   : 6.9862   Mean   :458.5  
##  3rd Qu.:490.0      3rd Qu.: 8.1667   3rd Qu.:526.0  
##  Max.   :796.0      Max.   :13.2667   Max.   :961.0
# More detailed summary
Clean_DailySleep %>%
  mutate(
    sleep_hours = TotalMinutesAsleep / 60,
    time_in_bed_hours = TotalTimeInBed / 60,
    sleep_efficiency = (TotalMinutesAsleep / TotalTimeInBed) * 100
  ) %>%
  summarise(
    avg_sleep_hours = mean(sleep_hours),
    median_sleep_hours = median(sleep_hours),
    min_sleep_hours = min(sleep_hours),
    max_sleep_hours = max(sleep_hours),
    avg_time_in_bed = mean(time_in_bed_hours),
    avg_sleep_efficiency = mean(sleep_efficiency),
    total_sleep_records = n(),
    unique_users = n_distinct(Id)
  )
## # A tibble: 1 × 8
##   avg_sleep_hours median_sleep_hours min_sleep_hours max_sleep_hours
##             <dbl>              <dbl>           <dbl>           <dbl>
## 1            6.99               7.21           0.967            13.3
## # ℹ 4 more variables: avg_time_in_bed <dbl>, avg_sleep_efficiency <dbl>,
## #   total_sleep_records <int>, unique_users <int>

Sleep Analysis Key Insights

Average Sleep: 6.99 hours (just below recommended 7 hours). Users are almost getting enough sleep, but not quite.

# Calculate average sleep per user
user_sleep_summary <- Clean_DailySleep %>%
  mutate(
    sleep_hours = TotalMinutesAsleep / 60,
    sleep_efficiency = (TotalMinutesAsleep / TotalTimeInBed) * 100
  ) %>%
  group_by(Id) %>%
  summarise(
    avg_sleep_hours = mean(sleep_hours),
    avg_time_in_bed = mean(TotalTimeInBed) / 60,
    avg_sleep_efficiency = mean(sleep_efficiency),
    sleep_records = n()
  ) %>%
  ungroup()

# View the results
head(user_sleep_summary)
## # A tibble: 6 × 5
##           Id avg_sleep_hours avg_time_in_bed avg_sleep_efficiency sleep_records
##        <dbl>           <dbl>           <dbl>                <dbl>         <int>
## 1 1503960366            6.00            6.39                 93.6            25
## 2 1644430081            4.9             5.77                 88.2             4
## 3 1844505072           10.9            16.0                  67.8             3
## 4 1927972279            6.95            7.30                 94.7             5
## 5 2026352035            8.44            8.96                 94.1            28
## 6 2320127002            1.02            1.15                 88.4             1
summary(user_sleep_summary)
##        Id            avg_sleep_hours  avg_time_in_bed  avg_sleep_efficiency
##  Min.   :1.504e+09   Min.   : 1.017   Min.   : 1.150   Min.   :63.37       
##  1st Qu.:2.340e+09   1st Qu.: 5.605   1st Qu.: 6.284   1st Qu.:91.40       
##  Median :4.502e+09   Median : 6.954   Median : 7.434   Median :93.97       
##  Mean   :4.764e+09   Mean   : 6.291   Mean   : 6.999   Mean   :91.30       
##  3rd Qu.:6.822e+09   3rd Qu.: 7.488   3rd Qu.: 8.121   3rd Qu.:94.87       
##  Max.   :8.792e+09   Max.   :10.867   Max.   :16.017   Max.   :98.49       
##  sleep_records  
##  Min.   : 1.00  
##  1st Qu.: 4.75  
##  Median :20.50  
##  Mean   :17.08  
##  3rd Qu.:27.25  
##  Max.   :31.00

Sleep Efficiency: 91.3% (EXCELLENT!)

- Above the 85% threshold for good sleep quality

- Users fall asleep quickly and stay asleep

- Median of 94% means most users have very efficient sleep

- Range: 63%-98% shows most users sleep well once in bed

Let’s see the visualization based on this information

# Sleep efficiency histogram
Clean_DailySleep %>%
  mutate(sleep_efficiency = (TotalMinutesAsleep / TotalTimeInBed) * 100) %>%
  ggplot(aes(x = sleep_efficiency)) +
  geom_histogram(binwidth = 2, fill = "#1982C4", color = "white") +
  geom_vline(aes(xintercept = 85), color = "red", linetype = "dashed", linewidth = 1) +
  annotate("text", x = 80, y = 40, label = "85% efficiency\nthreshold", color = "red") +
  labs(title = "Sleep Efficiency Distribution",
       subtitle = "Sleep efficiency = (Time Asleep / Time in Bed) × 100%",
       x = "Sleep Efficiency (%)",
       y = "Frequency") +
  theme_minimal()

# Classify users by sleep duration
user_sleep_summary <- user_sleep_summary %>%
  mutate(
    sleep_category = case_when(
      avg_sleep_hours < 6 ~ "Insufficient Sleep (<6h)",
      avg_sleep_hours < 7 ~ "Below Recommended (6-7h)",
      avg_sleep_hours <= 9 ~ "Recommended (7-9h)",
      TRUE ~ "Excessive Sleep (>9h)"
    )
  )

# See distribution
table(user_sleep_summary$sleep_category)
## 
## Below Recommended (6-7h)    Excessive Sleep (>9h) Insufficient Sleep (<6h) 
##                        5                        1                        8 
##       Recommended (7-9h) 
##                       10
# With percentages
user_sleep_summary %>%
  count(sleep_category) %>%
  mutate(percentage = n / sum(n) * 100) %>%
  arrange(desc(n))
## # A tibble: 4 × 3
##   sleep_category               n percentage
##   <chr>                    <int>      <dbl>
## 1 Recommended (7-9h)          10      41.7 
## 2 Insufficient Sleep (<6h)     8      33.3 
## 3 Below Recommended (6-7h)     5      20.8 
## 4 Excessive Sleep (>9h)        1       4.17

Let’s create visualization based on this

# Bar chart of sleep categories
user_sleep_summary %>%
  count(sleep_category) %>%
  mutate(
    percentage = n / sum(n) * 100,
    sleep_category = factor(sleep_category, 
                           levels = c("Insufficient Sleep (<6h)",
                                    "Below Recommended (6-7h)",
                                    "Recommended (7-9h)",
                                    "Excessive Sleep (>9h)"))
  ) %>%
  ggplot(aes(x = sleep_category, y = n, fill = sleep_category)) +
  geom_col() +
  geom_text(aes(label = paste0(n, " users\n(", round(percentage, 1), "%)")), 
            vjust = -0.3, size = 4) +
  scale_fill_manual(values = c("Insufficient Sleep (<6h)" = "#E63946",
                                "Below Recommended (6-7h)" = "#F4A261",
                                "Recommended (7-9h)" = "#2A9D8F",
                                "Excessive Sleep (>9h)" = "#264653")) +
  labs(title = "Sleep Quality Classification",
       subtitle = "Based on CDC recommendations (7-9 hours for adults)",
       x = "",
       y = "Number of Users") +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 15, hjust = 1)) +
  ylim(0, max(table(user_sleep_summary$sleep_category)) + 3)

Interpretation:

- The largest group with total of 10 users are users with 7-9 hours of sleep. These individuals meet standard health guidelines for sleep duration.

- ‘Insufficient Sleep (<6h)’ is the second largest group with total of 8 users. These users are significantly sleep-deprived.

- ‘Below Recommended (6-7h)’, these users are slightly under the target with total of 5 users, often referred to as “short sleepers.

- A very small minority who sleep longer than the typical clinical recommendation came from the group with excessive sleep (>9h), with only 1 user.

# Histogram of sleep hours
Clean_DailySleep %>%
  mutate(sleep_hours = TotalMinutesAsleep / 60) %>%
  ggplot(aes(x = sleep_hours)) +
  geom_histogram(binwidth = 0.5, fill = "#6A4C93", color = "white") +
  geom_vline(aes(xintercept = 7), color = "green", linetype = "dashed", linewidth = 1) +
  geom_vline(aes(xintercept = 9), color = "green", linetype = "dashed", linewidth = 1) +
  annotate("text", x = 8, y = 50, label = "Recommended\n7-9 hours", color = "green") +
  labs(title = "Distribution of Sleep Duration",
       subtitle = "Green lines indicate CDC recommended range (7-9 hours)",
       x = "Hours of Sleep",
       y = "Frequency") +
  theme_minimal()

Sleep + Activity Relationship

Next analysis is to find out the relationship between Sleep and overall activity. But since the data type of SleepDay on Clean_DailySleep is still POSIXct, I need to change the format to date before merging it with Daily Activity dataframe.

# Convert SleepDay from POSIXct to Date
Clean_DailySleep <- Clean_DailySleep %>%
  mutate(SleepDay = as.Date(SleepDay))

# Verify the conversion
class(Clean_DailySleep$SleepDay)
## [1] "Date"
head(Clean_DailySleep$SleepDay)
## [1] "2016-04-12" "2016-04-13" "2016-04-15" "2016-04-16" "2016-04-17"
## [6] "2016-04-19"

Now the format is set to date, and ready to merge.

# Merge Daily Activity and Daily Sleep dataframe
sleep_activity <- Clean_DailyActivity %>%
  inner_join(Clean_DailySleep, by = c("Id" = "Id", "ActivityDate" = "SleepDay"))

# View it
head(sleep_activity)
## # A tibble: 6 × 22
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
##        <dbl> <date>            <dbl>         <dbl>           <dbl>
## 1 1503960366 2016-04-12        13162          8.5             8.5 
## 2 1503960366 2016-04-13        10735          6.97            6.97
## 3 1503960366 2016-04-15         9762          6.28            6.28
## 4 1503960366 2016-04-16        12669          8.16            8.16
## 5 1503960366 2016-04-17         9705          6.48            6.48
## 6 1503960366 2016-04-19        15506          9.88            9.88
## # ℹ 17 more variables: LoggedActivitiesDistance <dbl>,
## #   VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## #   LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## #   VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>,
## #   TotalActiveMinutes <dbl>, day_of_week <ord>, day_type <chr>,
## #   TotalSleepRecords <dbl>, TotalMinutesAsleep <dbl>, TotalTimeInBed <dbl>, …

Summary 4

# Add sleep-related variables
sleep_activity <- sleep_activity %>%
  mutate(
    sleep_hours = TotalMinutesAsleep / 60,
    sleep_efficiency = (TotalMinutesAsleep / TotalTimeInBed) * 100,
    total_active_minutes = VeryActiveMinutes + FairlyActiveMinutes + LightlyActiveMinutes
  )

# Summary
summary(sleep_activity %>% select(sleep_hours, TotalSteps, Calories, total_active_minutes))
##   sleep_hours        TotalSteps       Calories    total_active_minutes
##  Min.   : 0.9667   Min.   :   17   Min.   : 257   Min.   :  2.0       
##  1st Qu.: 6.0167   1st Qu.: 5189   1st Qu.:1841   1st Qu.:206.5       
##  Median : 7.2083   Median : 8913   Median :2207   Median :263.5       
##  Mean   : 6.9862   Mean   : 8515   Mean   :2389   Mean   :259.5       
##  3rd Qu.: 8.1667   3rd Qu.:11370   3rd Qu.:2920   3rd Qu.:315.5       
##  Max.   :13.2667   Max.   :22770   Max.   :4900   Max.   :540.0

Next, the relationship I want to find out are:

1. Is there any significant correlation between Sleep Hour vs. Steps?

2. Is there any significant correlation between Sleep Hour vs. Calories? Does sleeps more can burn calories?

3. Is there any significant correlation between Sleep Hour vs. Active Minutes?

4. Is there any significant correlation between Sleep Efficiency vs. Steps?

# Key correlations
cor(sleep_activity$sleep_hours, sleep_activity$TotalSteps)
## [1] -0.1903439
cor(sleep_activity$sleep_hours, sleep_activity$Calories)
## [1] -0.03169899
cor(sleep_activity$sleep_hours, sleep_activity$total_active_minutes)
## [1] -0.06929398
cor(sleep_activity$sleep_efficiency, sleep_activity$TotalSteps)
## [1] -0.1100255
# Create a correlation summary
sleep_correlations <- data.frame(
  Metric = c("Sleep Hours vs Steps", 
             "Sleep Hours vs Calories",
             "Sleep Hours vs Active Minutes",
             "Sleep Efficiency vs Steps"),
  Correlation = c(
    cor(sleep_activity$sleep_hours, sleep_activity$TotalSteps),
    cor(sleep_activity$sleep_hours, sleep_activity$Calories),
    cor(sleep_activity$sleep_hours, sleep_activity$total_active_minutes),
    cor(sleep_activity$sleep_efficiency, sleep_activity$TotalSteps)
  )
)

print(sleep_correlations)
##                          Metric Correlation
## 1          Sleep Hours vs Steps -0.19034392
## 2       Sleep Hours vs Calories -0.03169899
## 3 Sleep Hours vs Active Minutes -0.06929398
## 4     Sleep Efficiency vs Steps -0.11002554

Before finally revealing the interpretation behind these results, let’s create a visualization based on this

# Visualization 1: Sleep vs Steps
ggplot(sleep_activity, aes(x = sleep_hours, y = TotalSteps)) +
  geom_point(alpha = 0.5, color = "#6A4C93") +
  geom_smooth(method = "lm", color = "#E63946", se = TRUE) +
  geom_vline(xintercept = 7, linetype = "dashed", color = "green", alpha = 0.5) +
  geom_vline(xintercept = 9, linetype = "dashed", color = "green", alpha = 0.5) +
  annotate("text", x = 8, y = max(sleep_activity$TotalSteps) * 0.9, 
           label = "Recommended\n7-9 hours", color = "green", size = 3) +
  labs(title = "Sleep Duration vs Daily Steps: No Positive Relationship",
       subtitle = "Correlation: -0.19 (weak negative)",
       x = "Hours of Sleep",
       y = "Total Steps") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

The Surprising Finding: No Positive Correlation

Contrary to conventional wisdom that “better sleep leads to more activity,” this analysis reveals no positive relationship between sleep duration and physical activity levels.

Correlation Analysis:

- Sleep Hours vs Steps: -0.19 (weak negative)

- Sleep Hours vs Calories: -0.03 (essentially zero)

- Sleep Hours vs Active Minutes: -0.07 (essentially zero)

- Sleep Efficiency vs Steps: -0.11 (weak negative)

The scatter plot visualization clearly shows a slight downward trend - this means users who sleep more tend to take slightly fewer steps, though the relationship is weak and highly variable.

Next, let’s find out the relationship between Daily Steps and Sleep Quality.

# Average steps by sleep category
sleep_activity <- sleep_activity %>%
  mutate(
    sleep_category = case_when(
      sleep_hours < 6 ~ "Insufficient (<6h)",
      sleep_hours < 7 ~ "Below Rec. (6-7h)",
      sleep_hours <= 9 ~ "Recommended (7-9h)",
      TRUE ~ "Excessive (>9h)"
    ),
    sleep_category = factor(sleep_category,
                           levels = c("Insufficient (<6h)",
                                    "Below Rec. (6-7h)",
                                    "Recommended (7-9h)",
                                    "Excessive (>9h)"))
  )

# Bar chart
sleep_activity %>%
  group_by(sleep_category) %>%
  summarise(
    avg_steps = mean(TotalSteps),
    count = n()
  ) %>%
  ggplot(aes(x = sleep_category, y = avg_steps, fill = sleep_category)) +
  geom_col() +
  geom_text(aes(label = paste0(round(avg_steps, 0), " steps\n(n=", count, ")")),
            vjust = -0.3, size = 3.5) +
  scale_fill_manual(values = c("Insufficient (<6h)" = "#E63946",
                                "Below Rec. (6-7h)" = "#F4A261",
                                "Recommended (7-9h)" = "#2A9D8F",
                                "Excessive (>9h)" = "#264653")) +
  labs(title = "Average Daily Steps by Sleep Quality",
       subtitle = "Well-rested users don't necessarily take more steps",
       x = "",
       y = "Average Steps") +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 15, hjust = 1)) +
  ylim(0, max(tapply(sleep_activity$TotalSteps, sleep_activity$sleep_category, mean)) * 1.15)

Device Usage & Tracking Consistency

This analysis is to understanding how consistently users engage with their fitness trackers reveals critical insights about user behavior and product stickiness.

Activity Tracking

# How many days did each user track activity?
user_tracking <- Clean_DailyActivity %>%
  group_by(Id) %>%
  summarise(
    tracking_days = n(),
    avg_steps = mean(TotalSteps)
  ) %>%
  arrange(desc(tracking_days))

# Summary statistics
summary(user_tracking$tracking_days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   38.50   42.00   39.23   42.00   62.00
# View the data
head(user_tracking)
## # A tibble: 6 × 3
##           Id tracking_days avg_steps
##        <dbl>         <int>     <dbl>
## 1 4020332650            62     4115.
## 2 1503960366            49    12175.
## 3 1624580081            49     5137.
## 4 4445114986            45     4729.
## 5 4702921684            45     8553 
## 6 6962181067            44    10789.

Let’s create a visualization based on this summary

# Viz1: Activity Tracking Distribution
ggplot(user_tracking, aes(x = tracking_days)) +
  geom_histogram(binwidth = 5, fill = "#2E86AB", color = "white") +
  geom_vline(aes(xintercept = median(tracking_days)), 
             color = "red", linetype = "dashed", linewidth = 1) +
  annotate("text", x = median(user_tracking$tracking_days) + 5, y = 8,
           label = paste("Median:", median(user_tracking$tracking_days), "days"),
           color = "red") +
  labs(title = "User Engagement: Days of Activity Tracking",
       subtitle = "How consistently do users wear their devices?",
       x = "Number of Days Tracked",
       y = "Number of Users") +
  theme_minimal()

Activity Tracking: Strong Engagement

Finding: Users demonstrate solid commitment to activity tracking, with a median of 42 days tracked over the study period.

Key Statistics:

- Median: 42 days

- Average: 39.2 days

- Range: 8-62 days

- Most users (20 out of 35) tracked for 38-42+ days

Interpretation: The concentration of users around the 40-day mark suggests good device adoption and habit formation. Users who start tracking tend to stick with it for at least a month, indicating the value proposition for activity tracking is clear and compelling.

Sleep Tracking

# How many days did each user track sleep?
sleep_tracking <- Clean_DailySleep %>%
  group_by(Id) %>%
  summarise(sleep_days = n()) %>%
  arrange(desc(sleep_days))

# Summary statistics
summary(sleep_tracking$sleep_days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.75   20.50   17.08   27.25   31.00
# View it
head(sleep_tracking)
## # A tibble: 6 × 2
##           Id sleep_days
##        <dbl>      <int>
## 1 5553957443         31
## 2 6962181067         31
## 3 8378563200         31
## 4 2026352035         28
## 5 3977333714         28
## 6 4445114986         28
# Merge tracking data
tracking_comparison <- user_tracking %>%
  left_join(sleep_tracking, by = "Id") %>%
  mutate(sleep_days = ifelse(is.na(sleep_days), 0, sleep_days))

# Summary
tracking_comparison %>%
  summarise(
    avg_activity_days = mean(tracking_days),
    avg_sleep_days = mean(sleep_days, na.rm = TRUE),
    users_tracking_sleep = sum(sleep_days > 0)
  )
## # A tibble: 1 × 3
##   avg_activity_days avg_sleep_days users_tracking_sleep
##               <dbl>          <dbl>                <int>
## 1              39.2           11.7                   24

Sleep Tracking: The Engagement Gap

Finding: Sleep tracking shows significantly lower engagement, with a median of only 20.5 days - less than half the activity tracking rate.

Key Statistics:

- Median: 20.5 days (among those who tracked)

- Average across all users: 11.7 days

- 31% of users (11 out of 35) never tracked sleep at all

- Sleep tracking is 50% less consistent than activity tracking

# Viz2: Activity vs. Sleep Tracking Comparison
tracking_long <- tracking_comparison %>%
  select(Id, tracking_days, sleep_days) %>%
  pivot_longer(cols = c(tracking_days, sleep_days),
               names_to = "tracking_type",
               values_to = "days") %>%
  mutate(tracking_type = ifelse(tracking_type == "tracking_days", 
                                "Activity Tracking", 
                                "Sleep Tracking"))

# Box plot comparison
ggplot(tracking_long, aes(x = tracking_type, y = days, fill = tracking_type)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Activity Tracking" = "#2E86AB", 
                                "Sleep Tracking" = "#6A4C93")) +
  labs(title = "Tracking Consistency: Activity vs Sleep",
       subtitle = "Sleep tracking is significantly less consistent",
       x = "",
       y = "Days Tracked") +
  theme_minimal() +
  theme(legend.position = "none")

The Dramatic Difference: The box plot visualization clearly shows activity tracking clustered around 40 days, while sleep tracking has a much wider distribution with many users at zero or very low numbers.

Consistency vs. Activity Level

# Does tracking consistency correlate with activity?
ggplot(user_tracking, aes(x = tracking_days, y = avg_steps)) +
  geom_point(color = "#2E86AB", size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", color = "#E63946", se = TRUE) +
  labs(title = "Does Consistent Tracking Lead to More Steps?",
       subtitle = paste("Correlation:", 
                       round(cor(user_tracking$tracking_days, user_tracking$avg_steps), 3)),
       x = "Days Tracked",
       y = "Average Daily Steps") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# Calculate correlation
cor(user_tracking$tracking_days, user_tracking$avg_steps)
## [1] 0.330517

Engagement Drives Results

Critical Finding: Tracking consistency correlates positively with activity levels (r = 0.33).

Users who tracked for 40+ days averaged ~8,000 steps, while those tracking fewer than 20 days averaged ~4,000 steps.

This suggests:

- Consistent tracking creates accountability

- Regular data viewing motivates behavior change

- Habit formation requires sustained engagement

EXECUTIVE SUMMARY & ACTIONABLE RECOMMENDATIONS

Bellabeat Case Study: Key findings & strategic recommendations

Executive Summary

Analysis of 35 Fitbit users’ activity, sleep, and usage patterns reveals critical opportunities for Bellabeat to differentiate itself in the wellness technology market. My findings challenge conventional wisdom about sleep-activity relationships while identifying clear user segments and engagement gaps.


Top 5 Critical Findings

  1. The Sedentary Crisis (16.7 hours/day) Users are sedentary for over 16 hours daily but achieve only 34 minutes of moderate-to-vigorous activity. This represents the primary health risk and biggest intervention opportunity.

  2. Majority Users Need Support (57%) 31% are sedentary and 26% are lightly active - together representing the core target audience who need motivation and behavioral change support.

  3. Sleep-Activity Paradox Better sleep does NOT lead to more activity (r = -0.19). Sleep and activity are independent behaviors requiring separate interventions, not a simple “sleep more → move more” message.

  4. Sleep Tracking Engagement Gap Only 68% of users track sleep, and those who do track 50% less consistently than activity (20 vs 42 days median). This represents a major UX challenge.

  5. Engagement Drives Results Consistent tracking correlates with higher activity (r = 0.33). Users tracking 40+ days average 8,000 steps vs 4,000 steps for those tracking <20 days.


Strategic Recommendations for Bellabeat

PRIORITY 1: Combat Sedentary Behavior

The Problem: 16.7 hours/day sedentary time is unsustainable and unhealthy.

Solutions:

  • Giving Hourly Movement Reminders for users within this category: Example of reminder, “You’ve been sitting for 50 minutes - time for a 2-minute walk!”
  • Showing Sedentary Time Tracking: Make sitting visible like steps, for example, “You sat for 14 hours today. Get up and move!”
  • Set Micro-Movement Goals: “Take 250 steps every hour” instead of overwhelming 10,000-step goals
  • 3 PM Wake-Up Call: Target the afternoon activity dip with energizing challenges

PRIORITY 2: Segment-Specific Engagement

The Problem: 57% of users are sedentary or lightly active and need different support than the 20% who are very active.

So here’s the recommendation for every category:

  1. Sedentary Users (31%) - “The Beginners”
  • Start with 3,000-5,000 step goals
  • Celebrate ANY increase, for example, “You walked 500 more steps than yesterday!”
  • Focus on consistency over volume
  • Gentle, encouraging tone
  1. Lightly Active Users (26%) - “The Strivers”
  • Progressive goals: 5K → 7.5K → 10K steps
  • Weekly challenges with achievable targets
  • Social features for accountability
  • Highlight health benefits of incremental improvements
  1. Fairly Active Users (23%) - “The Almost-Theres”
  • Push toward 10K steps with targeted motivation
  • Introduce intensity tracking (not just steps)
  • Competitive leaderboards
  • Reward consistency streaks
  1. Very Active Users (20%) - “The Athletes”
  • Advanced analytics (VO2 max, recovery metrics)
  • Training plans and periodization
  • Performance optimization insights
  • Community mentorship opportunities

PRIORITY 3: Fix Sleep Tracking UX

The Problem: 31% never track sleep; those who do track 50% less than activity.

Solutions:

  1. Auto-Sleep Detection: Remove the need to manually activate sleep mode

  2. Charging Solutions:

    • Rapid charging (80% in 30 minutes during morning routine)

    • Alternative: Two devices (wear one while charging the other)

  3. Comfort First: Smaller, lighter sleep-specific device or redesigned band

  4. Demonstrate Value: Show “Sleep Score” and next-day energy predictions immediately

  5. Bedtime Reminders: “To get 8 hours, go to bed by 10:30 PM”

PRIORITY 4: Optimal Timing for Interventions

The Problem: Users have distinct activity patterns that current generic notifications ignore.

Time-Based Strategy:

  • 5-7 PM (Peak Activity): Send workout challenges, achievement celebrations
  • 12-1 PM (Lunch Peak): “Take a 10-minute walk after lunch”
  • 3 PM (Afternoon Dip): “Energy slump? 5-minute movement break!”
  • 11:30 AM: Pre-lunch walk reminders
  • 9 PM: Wind-down routine prompts for better sleep
  • Avoid 8-10 AM: Commute chaos, low engagement window

PRIORITY 5: Emphasize Intensity Over Volume

The Problem: Users fixate on steps but ignore intensity, despite intensity having stronger correlation with calories (r = 0.59 vs 0.58).

Solutions:

- Intensity Zones: Display time in heart rate zones prominently

- Quality Over Quantity: “20 minutes of vigorous activity > 20,000 light steps”

- 10-Minute HIIT Challenges: For time-constrained users

- Distance + Intensity Metrics: Track both steps and “active distance”

- Reframe Success: “You hit your intensity goal!” not just “You hit your steps!”

PRIORITY 6: Build Habit Formation Features

The Problem: Tracking consistency predicts success, but many users don’t build lasting habits.

Solutions:

  • Onboarding Excellence: First 7 days are critical - daily check-ins and wins
  • Streak Tracking: “You’ve tracked 15 days in a row! Keep it going!”
  • 30-Day Milestone: Major celebration and reward at 1 month
  • Habit Stacking: “Track your morning steps right after your coffee”
  • Social Accountability: Share streaks with friends
  • Recovery Paths: “You missed yesterday - let’s get back on track today”

PRIORITY 7: Separate Sleep & Activity Messaging

The Problem: No positive correlation between sleep and activity challenges the “sleep better → move more” narrative.

New Approach:

  • Independent Value Props: Promote each for its own benefits
  • No False Promises: Don’t claim sleep will directly increase activity
  • Holistic Wellness: “Great sleep AND great activity = optimal health”
  • Different Motivations: Some users prioritize sleep, others prioritize fitness - honor both
  • Targeted Messaging: - Active + Poor Sleep: “Protect your performance with recovery” - Sedentary + Good Sleep: “You’re well-rested - let’s move!”

Conclusion

The path to better health isn’t about revolutionary changes - it’s about sustainable, personalized, and consistent small improvements. Bellabeat’s opportunity lies in meeting users where they are, understanding their unique challenges, and providing intelligent, compassionate support that respects the complexity of real human behavior.

The winning formula: Segment intelligently + Reduce friction + Time interventions + Celebrate progress + Build habits.

That’s it from me for the Capstone project in the Bellabeat Case Study. Thank you so much for your interest in the project!

Alifia Ganjaraharja